Corpus Statistics Empowered Document Classification

نویسندگان

چکیده

In natural language processing (NLP), document classification is an important task that relies on the proper thematic representation of documents. Gaussian mixture-based clustering widespread for capturing rich semantics but ignores emphasizing potential terms in corpus. Moreover, soft approach causes long-tail noise by putting every word into cluster, which affects documents and their classification. It more challenging to capture semantic insights when dealing with short-length where co-occurrence information limited. this context, long texts, we proposed Weighted Sparse Document Vector (WSDV), performs weighted data emphasizes vital moderates removing outliers from converged clusters. Besides removal outliers, WSDV utilizes corpus statistics different steps vectorial document. For short Compact (WCDV), captures better building vectors uncertainty while measuring affinity between distributions words. Using available statistics, WCDV sufficiently handles sparsity texts without depending external knowledge sources. To evaluate models, performed a multiclass using standard performance measures (precision, recall, f1-score, accuracy) three long- two short-text benchmark datasets outperform some state-of-the-art models. The experimental results demonstrate long-text classification, reached 97.83% accuracy AgNews dataset, 86.05% 20Newsgroup 98.67% R8 dataset. 72.7% SearchSnippets dataset 89.4% Twitter

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Corpus Statistics in Text Classification of Online Data

Transformation of Machine Learning (ML) from a boutique science to a generally accepted technology has increased importance of reproduction and transportability of ML studies. In the current work, we investigate how corpus characteristics of textual data sets correspond to text classification results. We work with two data sets gathered from sub-forums of an online health-related forum. Our emp...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

Document Analysis And Classification Based On Passing Window

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...

متن کامل

Visualization of Text Document Corpus

From the automated text processing point of view, natural language is very redundant in the sense that many different words share a common or similar meaning. For computer this can be hard to understand without some background knowledge. Latent Semantic Indexing (LSI) is a technique that helps in extracting some of this background knowledge from corpus of text documents. This can be also viewed...

متن کامل

Useful statistics for corpus linguistics

• frequencies of occurrence of linguistic elements, which can be studied from two different perspectives: o how frequent are morphemes or words or patterns/constructions in (parts of) a corpus? This information can be provided in various different forms of frequency lists; o how evenly are morphemes or words or patterns/constructions distributed across (parts of) a corpus? This information can ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Electronics

سال: 2022

ISSN: ['2079-9292']

DOI: https://doi.org/10.3390/electronics11142168